Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

نویسندگان

چکیده

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge beneficial for of however rarely considered existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), performs relational with both and language learned via self-supervised pre-training multimodal representations. RVL-BERT also uses an effective spatial module mask explicitly capture information objects. Moreover, our model decouples object recognition taking names directly, enabling be used on top any system. We show through quantitative qualitative experiments that, transferred modules, achieves competitive results two challenging datasets. The source code available at https://github.com/coldmanck/RVL-BERT.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues

This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues.1 We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. We pay special attention to relationships between pe...

متن کامل

Supplementary Materials of Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Spatial features have been demonstrated to be helpful in visual tasks such as object detection, image retrieval, and semantic segmentation [13, 12, 9, 6, 1, 2, 7]. For visual relationship detection task, spatial features such as the relative location and size of two objects are informative for predicate prediction. For example, relative location is a discriminative feature for predicates “under...

متن کامل

on the relationship between linguistic, visual-spatial, interpersonal intelligences and technical translation quality

abstract this study tried to investigate whether there was any significant relationship between technical translation quality of the senior english translation students and their levels of verbal-linguistic, visual-spatial and interpersonal intelligences. in order to investigate the research questions, the researcher selected a hundred senior english translation students from three universitie...

Extracting Visual Knowledge from the Web with Multimodal Learning

We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this paper we present a multimodal learning algor...

متن کامل

A Critical Visual Analysis of Gender Representation of ELT Materials from a Multimodal Perspective

This content analysis study, employing a multimodal perspective and critical visual analysis, set out to analyze gender representations in Top Notch series, one of the highly used ELT textbooks in Iran. For this purpose, six images were selected from these series and analyzed in terms of ‘representational’, ‘interactive’ and ‘compositional’ modes of meanings. The result indicated that there are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2021

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2021.3069041